Main Analysis (Exploratory Data Analysis)
# dropping rows that have NAs in the hammer price column, which indicates that the lots are not sold
# drop the rows that are from doha, as indicated in the data quality analysis
df1 <- art_df %>% drop_na(hammer_price_bp_usd)
df2 <- df1 %>%
filter(abs(df1$hammer_price_bp_usd -
median(df1$hammer_price_bp_usd)) <=3*sd(df1$hammer_price_bp_usd)) %>%
filter(location %in% c("HONG KONG", "NEW YORK", "LONDON", "PARIS", "MILAN","DUBAI","AMSTERDAM"))
Auction Information
Our research can be divided into three sections: General questions about the data, understanding fluctuations in lot prices and deriving general conclusion.
We started by asking many questions about possible relationships between variables. The first set of plots will explore the correlation between number of lots sold and year, location and season. We were hoping to notice meaningful trends that can be further explored in subsequent sections. Since we had only a few locations, seasons and years we chose a barchart and excluded duplicate rows by the ‘number of lots’ column.
library(gridExtra)
##Lots by location
MyData3 <- subset(art_final, select=c( "location", "number_of_lots" ))
MyData3 <- MyData3[!duplicated(MyData3$number_of_lots),]
MyData4 <- MyData3 %>% group_by(location)%>% summarise(B=sum(number_of_lots))
p1 <- ggplot(MyData4, aes(x= location, y = B)) +
geom_bar( stat='identity', color="blue", fill="grey") +labs(x = "Location")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by location") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
MyData_1 <- subset(art_final, select=c( "auc_year", "number_of_lots" ))
MyData_1 <- MyData_1[!duplicated(MyData_1$number_of_lots),]
MyData_1 <- MyData_1 %>% group_by(auc_year)%>% summarise(B=sum(number_of_lots))
p2 <- ggplot(MyData_1, aes(x= auc_year, y = B)) +
geom_bar( stat='identity', color="yellow", fill="grey") +labs(x = "Auction year")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by year") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()+ scale_x_continuous(breaks= c(2006, 2010, 2014, 2017))
grid.arrange(p1, p2, nrow = 1)
##Lots by season
MyData3 <- subset(art_final, select=c( "auc_season", "number_of_lots" ))
MyData3 <- MyData3[!duplicated(MyData3$number_of_lots),]
MyData4 <- MyData3 %>% group_by(auc_season)%>% summarise(B=sum(number_of_lots))
ggplot(MyData4, aes(x= auc_season, y = B)) +
geom_bar( stat='identity', color="orange", fill="grey") +labs(x = "Auction season")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by season") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
Graphs above suggest that fall and spring are the most popular seasons to acquire a masterpiece. At the same time, there was a significant increase in number of lots sold since 2010, suggesting that investors started to see the art market as a form of long-term non-liquid investment after the financial crisis of 2008. The main hubs for art exchanges formed in London, New York and Paris.
Plotting the number of auctions based on location, season and year turned out to have similar results and confirmed ideas suggested above.
##Auctions by location, year, season
MyData3 <- subset(art_final, select=c( "auction_id", "location" ))
MyData3 <- MyData3[!duplicated(MyData3$auction_id),]
MyData4 <- MyData3 %>% group_by(location)%>% count(auction_id) %>% summarise(B=sum(n))
p1<-ggplot(MyData4, aes(x= location, y = B)) +
geom_bar( stat='identity', color="green", fill="grey") +labs(x = "Auction location")+labs(y = "Number of auctions") + ggtitle("Number of auctions by location") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
MyData_1 <- subset(art_final, select=c( "auction_id", "auc_year" ))
MyData_1 <- MyData_1[!duplicated(MyData_1$auction_id),]
MyData_1 <- MyData_1 %>% group_by(auc_year)%>% count(auction_id) %>% summarise(B=sum(n))
p2<- ggplot(MyData_1, aes(x= auc_year, y = B)) +
geom_bar( stat='identity', color="blue", fill="grey")+labs(y = "Number of auctions")+labs(x = "Year") + ggtitle("Number of auctions by year") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
grid.arrange(p1, p2, nrow = 1)
##Auctions by season
MyData3 <- subset(art_final, select=c( "auction_id", "auc_season" ))
MyData3 <- MyData3[!duplicated(MyData3$auction_id),]
MyData4 <- MyData3 %>% group_by(auc_season)%>% count(auction_id) %>% summarise(B=sum(n))
ggplot(MyData4, aes(x= auc_season, y = B)) +
geom_bar( stat='identity', color="orange", fill="grey") +labs(x = "Auction season")+labs(y = "Number of auctions") + ggtitle("Number of auctions by season") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
To see what are the different triggers of variability in art prices we decided to explore fluctuations in Hammer Prices. We created a histogram with a density curve to visualize the distribution of Hammer Price.
##Hammer Price
ggplot(art_final, aes(x= hammer_price_bp_usd)) + geom_histogram(aes(y=..density..)) + geom_density() + xlim(0,1500000) + theme_minimal()
#df2 <- art_final %>%
# filter(abs(art_final$hammer_price_bp_usd -
# median(art_final$hammer_price_bp_usd)) <=3*sd(art_final$hammer_price_bp_usd))
#ggplot(df2, aes(x= hammer_price_bp_usd)) + geom_histogram(aes(y=..density..)) + geom_density() + xlim(0,1500000)+ theme_minimal()
Distribution of Hammer Price is skewed to the right and has a long tail. We believe the reason for this is variability in art geners (Contemporary vs Renaissance for example) sold on the market. Some older masterpieces can be traded at prices that are completly out of range for contemporary artists.
Constracting valuable models in the next parts of our research fully depend on the ability to manipulate hammer price in the right way. We decided to remove a few data points from the tail by considering them as outliers.
##revenue by location
#MyData <- subset(art_final, select=c( "location", "auction_id", "hammer_price_bp_usd" ))
MyData5 <- art_final %>% group_by(location)%>% summarise(B=sum(hammer_price_bp_usd))
p1 <- ggplot(MyData5, aes(x= location, y = B)) +
geom_bar( stat='identity', color="yellow", fill="grey")+labs(y = "Auction Revenue")+labs(x = "Location") + ggtitle("Auction revenue by location") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
##revenue by year
MyData_1 <- subset(art_final, select=c( "auc_year", "auction_id", "hammer_price_bp_usd" ))
MyData_1 <- MyData_1 %>% group_by(auc_year)%>% summarise(B=sum(hammer_price_bp_usd))
p2<-ggplot(MyData_1, aes(x= auc_year, y = B)) +
geom_bar(stat='identity', color="green", fill="grey")+labs(y = "Auction Revenue")+labs(x = "Year") + ggtitle("Auction revenue by year") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()
grid.arrange(p1, p2, nrow = 1)
Adjusted hammer price brought our attention to distribution of revenue over time and location. For both plots we chose a bar chart. In general, auction revenues are going up, with highest revenues in a decade being in 2017. London and New York continue to lead the way as the main centers for the exchange of art.
##Clevelend dot plot
MyData2 <- strtrim(art_final$auc_title, 30)
art_final$auc_title <- MyData2
MyData1 <- art_final[!duplicated(art_final$auc_title),]
MyData3 <- subset(MyData1, select=c("auc_title", "number_of_lots"))
MyData3$auc_title <- factor(MyData3$auc_title, levels = MyData3$auc_title[order(MyData3$number_of_lots)])
MyData3<- MyData3[which(MyData3$number_of_lots>200),]
ggplot(MyData3, aes( x = auc_title, y = number_of_lots)) + geom_point(stat="identity", color="red") + coord_flip()+ theme_minimal()
MyDatas <-subset(art_final, select=c( "nth_in_auction", "auc_year" , "birth_year", "hammer_price_bp_usd", "location"))
library(GGally)
ggparcoord(MyDatas, columns = 1:(ncol(MyDatas)-1), scale = "uniminmax", groupColumn = "location", alphaLines =0.2)
#devtools::install_github("timelyportfolio/parcoords")
library(parcoords)
#library(httpuv)
parcoords(MyDatas,
rownames = F
, brushMode = "1D-axes", color = list(
colorBy = "location", colorScale = htmlwidgets::JS("d3.scale.category10()"))
)
Do Certain Lot Attributes Result in Higher Price?
Lot Titles
What lots have higher price?
df_wordcloud <- df1[,c("lot_title","hammer_price_bp_usd")]
df_wordcloud <- arrange(df_wordcloud,desc(df_wordcloud$hammer_price_bp_usd))[1:500,]
library(wordcloud)
library(tm)
pal <- brewer.pal(9, "OrRd")
pal <- pal[-(1:3)]
wordcloud(df_wordcloud$lot_title, df_wordcloud$hammer_price_bp_usd, min.freq=500, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)
#### What words appear more often in the lot titles?
# collapse the lot_title column by word and count the frequency they appear in the titles.
temp <- paste(df1$lot_title, collapse=' ' )
temp <- tolower(temp)
temp <- gsub(" *\\b[[:alpha:]]{1}\\b *", " ", temp)
temp <- gsub('[[:punct:] ]+',' ',temp)
temp <- as.list(strsplit(temp, " "))
temp <- unlist(temp)[!(unlist(temp) %in% stopwords("english"))]
temp <- unlist(temp)[!(unlist(temp) %in% "na")]
word_count <- na.omit(as.data.frame(table(temp)))
word_count <- arrange(word_count,desc(word_count$Freq))[1:300,]
# visualize word frequencies
pal <- brewer.pal(9, "Dark2")
wordcloud(word_count$temp, word_count$Freq, min.freq =20, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)
#### Looking at lots that have name “Untitled”, what price ranges are they in? Is it correlated?
library(vcd)
df1 <- df1 %>%
dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "<$50,000"))%>%
dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "<$500,000"))%>%
dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "$500,000+"))
vcd::mosaic(hammer_price_bp_usd_range~is_untitled, direction = c("v", "h"),df1,
gp = gpar(fill = c("lightyellow", "lightpink")),
labeling = labeling_border(rot_labels = c(0, 45),pos_labels="center"))
#### Are they mostly contemprary?
#untitle_labels = read.csv("untitle_ratio.csv", header=TRUE)
Does the era of the lot affect its price?
df3 <- df2 %>%
filter(df2$birth_year>1800)
ggplot(df3, aes(birth_year,hammer_price_bp_usd)) +
geom_smooth(method='lm',formula=y ~ poly(x, 2))+
geom_point(alpha = .1) +
theme_grey(10)+scale_y_log10()+ geom_density_2d(bins = 5)
#box plot price vs year
#ggplot(df3, aes(x=auth_era), y=hammer_price_bp_usd)+ stat_summary(fun.y="mean", geom="line", aes(group=1))
ggplot(df3, aes(auth_era, hammer_price_bp_usd)) +
geom_boxplot()+theme(axis.text.x = element_text(angle = 90, hjust = 1))+scale_y_log10()
## Do Certain Extrernal Factors Result in Higher Price? ###Does the Order Matter?
ggplot(df2, aes(percent_in_auction,hammer_price_bp_usd)) +
geom_smooth(method='lm',formula=y~x,color="red")+
geom_point(alpha = .05) +
theme_grey(10)+scale_y_log10()+
facet_wrap(~location)
### Is there an impact from the financial crisis?
df1$auc_ymd <- as.Date(df1$auc_year_month_date)
art_yearfin <- df1 %>% group_by(month=lubridate::floor_date(auc_ymd, "month")) %>% summarise(revenue = mean(hammer_price_bp_usd))
ggplot() +geom_line(data=art_yearfin, aes(x=month, y=revenue/1000000))+ggtitle("Financial Crisis' Effect on average lot price ($M)")+ylab("average price")+xlab("Time")+theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_y_log10()
Let’s start by looking at the average lot prices of Sothebeys on a yearly scale. Our guess is that we are supposed to see a significant drop around the time of the financial crisis.
art_finance <- art_final[c("auc_year", "auc_month", "location","hammer_price_bp_usd")] %>% filter(!is.na(hammer_price_bp_usd))
art_yearfin <- art_finance %>% group_by(auc_year) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_yearfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year")
Indeed, we are seeing a big deep around late 2010. It is surprising to see that it took some time for the effect to reach the auction houses. Perhaps it is certain locations that are skewing some of the data. Let us try to facet the data by location and see if that could present us with a better outlook.
art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year") + facet_wrap(~location, scales ="free_y")
It looks like, indeed, some locations do not have consistent data and might be throwing the calculation off (Dubai, Doha, and Amsterdam). Let us tke a look at the same chart without those locations.
art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(revenue = sum(hammer_price_bp_usd)) %>% filter(location %in% c("HONG KONG", "NEW YORK", "LONDON", "PARIS", "MILAN"))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year") + facet_wrap(~location, scales ="free_y")
We notice a consistent drop in revenue across all locations above starting 2010. Therefore, our hypothesys must be correct: the financial crisis did have an effect on the auction revenue across the world (specifically significant drops are observed in New York and Hong Kong).
Does Season Matter?
vcd::mosaic(hammer_price_bp_usd_range~auc_season, direction = c("v", "h"),df1,
gp = gpar(fill = c("lightyellow", "lightpink")),
labeling = labeling_border(rot_labels = c(0, 90),pos_labels="center"))
###Does Location Matter? # Executive Summary